home *** CD-ROM | disk | FTP | other *** search
-
-
- CHAPTER FIVE
-
- CD-ROM PRODUCTION ISSUES
-
-
-
- CD-ROM DATA COMPRESSION
-
- Dr Nicholas Beser
- Applied Physics Laboratory
- Johns Hopkins University
-
-
-
-
- This paper was done with overhead visuals listed below.
-
- %g BES01.pcx;
- %g BES02.pcx;
- %g BES03.pcx;
- %g BES04.pcx;
- %g BES05.pcx;
- %g BES06.pcx;
- %g BES07.pcx;
- %g BES08.pcx;
- %g BES09.pcx;
- %g BES10.pcx;
- %g BES11.pcx;
- %g BES12.pcx;
- %g BES13.pcx;
- %g BES14.pcx;
- %g BES15.pcx;
- %g BES16.pcx;
- %g BES17.pcx;
- %g BES18.pcx;
- %g BES19.pcx;
- %g BES20.pcx;
- %g BES21.pcx;
- %g BES22.pcx;
- %g BES23.pcx;
-
-
-
-
- ISO 9660 STANDARDS: A LAYMAN'S INTERPRETATION
-
- Dr Roger Hutchison
- President, CD-ROM Inc.
-
-
- ISO 9660 is the International Organization for
- Standardization (ISO) "information processing -- volume
- and file structure of CD-ROM for information exchange."
- The first edition was published on April 4, 1988. The
- reference number for people wishing to buy the document
- is ISO 9660: 1988 (E). It is available for $50.00 from:
-
- American National Standards Institutes
- 11 West 42nd Street
- New York NY 10036
-
- This document, which is only 29 pages long, describes the
- volume and file structure for CD-ROM discs. The
- International Standard specifies:
-
- the attributes of the volume and the
- descriptors placed on it
-
- the relationship among volumes of a
- volume set
-
- the placement of the files
-
- the attributes of the files; record
- structures intended for use in the
- input or output data streams of an
- application program when such data
- streams are required to be organized
- as sets of records
-
- three nested levels of medium
- interchange
-
- two nested levels of implementation
-
- requirements for the processes which
- are provided within information
- processing systems, to enable
- information to be interchanged
- between different systems, utilizing
- recorded CD-ROM as the medium of
- interchange. For this purpose it
- specifies the functions to be
- provided within systems which are
- intended to originate or receive CD-
- ROM which conform to the
- international standard
-
- Sections two through five of the ISO document relate
- to conformance issues and definitions which define the
- conformance levels. Section two, item six lays out the
- requirements for the medium in terms of the volume
- structure. Here is where the physical addresses, the
- logical sectors and the volume space is described and
- defined to comply with international standards.
-
- The Physical Address: The physical address is
- a unique address located on the disc. It is
- identifiable by a unique physical address as
- specified in the relevant standard for
- recording.
-
- The Logical Sector: The logical sectors on a
- CD-ROM disc have 2,048 bytes of information.
- There are eight bits in one byte, so each
- sector has eight times 2,048 or 16,384 bits of
- information. Here is one key element of the
- uniqueness of CD-ROM and the value for massive
- databases. Each logical sector has a unique
- and identifiable address as represented by a
- unique number.
-
- The Volume Space: The information on a volume
- is recorded in a set of all logical sectors.
- This is referred to as the Volume Space of the
- disc.
-
- What all this means in layman's terms is the
- following. Imagine a long highway system, say, 3.5 miles
- long. Now identify each one foot of the highway as having
- a unique address. In other words, say "mile one, foot
- 1,253" is an address. You can navigate to this unique
- spot depending on how fast your car runs in a matter of
- minutes. The address of the "mile-foot" map could be
- 1:1253 or some other arbitrary assignment. The important
- thing here is that the address is identifiable, unique
- and findable in your address system. Another important
- thing is that you can get there from here if you know the
- address. The faster your car is, the faster you can
- "drive" to that address.
- With CD-ROM, the address system is much the same
- but, of course, we are dealing with microscopic units of
- measurement found with a laser device moving in terms of
- its access speed. We can navigate in milliseconds on the
- surface of the CD-ROM disc because of this unique mapping
- system. Also, the "road" on a CD-ROM disc is a large
- concentric spiral starting from the inside of the disc
- and moving outwards. The spiral is roughly 3.5 miles
- long, but the laser head can jump to almost any spot on
- the disc by moving straight across the spiral until it
- finds the track where it is supposed to be. It then slows
- down and finds the sector and reads the data.
- The ISO 9660 standard is the road map for us to
- follow as we interchange data amongst diverse nations
- both geographically and politically. It crosses language
- barriers in all countries and is the single most reason
- why CD-ROM, as a technology, can be exchanged
- universally. A CD-ROM disc made in the USA can be read by
- people in France, Italy and Botswana. The same is not
- true for a VCR tape, a television camera or a betamax
- tape.
- The remainder of the ISO document simply defines
- more technical ways to navigate on the surface of the
- disc. Really, that is all there is to it!
-
-
- REFERENCE
-
- ISO 9660: 1988(E) Information Processing - Volume and
- File Structure of CD-ROM for Information Interchange.
-
-
- A CD-ROM MAINTENANCE INFORMATION SYSTEM
- FOR THE GENERAL AVIATION INDUSTRY
-
- Michael Sandifer
- Aircraft Technical Publishers
- Brisbane CA
-
-
- This paper was done with overhead visuals listed below.
-
-
- %g SAN02.pcx;
- %g SAN03.pcx;
- %g SAN04.pcx;
- %g SAN05.pcx;
- %g SAN06.pcx;
- %g SAN07.pcx;
- %g SAN08.pcx;
- %g SAN09.pcx;
- %g SAN10.pcx;
- %g SAN11.pcx;
- %g SAN12.pcx;
- %g SAN13.pcx;
- %g SAN14.pcx;
- %g SAN15.pcx;
- %g SAN16.pcx;
- %g SAN17.pcx;
- %g SAN18.pcx;
- %g SAN19.pcx;
-
-
- DATA CAPTURE: THE INS AND OUTS, DO'S AND DON'TS
-
- William Thornburg
- Director of Marketing
- Reference Technology Inc.
-
-
- Mark Twain was always hearing from people who claimed to
- be his double and he got tired of writing to these people
- and explaining to them they couldn't possibly be his
- double. So he had a letter run off, had a bunch of copies
- made and sent a form letter in response to these claims.
- I will read you the letter:
-
-
- My dear sir, Thank you very much for your
- letter and photograph. In my opinion you are
- more like me than any of my numerous doubles.
- I may even say you resemble me even more
- closely than I do myself. In fact I intend to
- use your picture to shave by. Yours
- thankfully, S. Clemens
-
- Data capture is like that - you would like to end up with
- a picture that you can shave by. In fact I think if you
- did shave by it you would end up with a few nicks. The
- real question in data capture is how many nicks are you
- willing to live with. Judy Zidar (National Agriculture
- Library) mentioned that data capture is the most
- expensive part of a CD-ROM project. I'll second that. The
- logical formatting part of CD-ROM mastering is usually
- what people think of when they think of making a CD-ROM.
- That's the piece that costs a couple of thousand dollars
- and takes a few days time. You can measure it in hours of
- time or hours of effort. The next step is the indexing
- step and you usually measure that in tens of hours.
- Another order of magnitude more expensive and time
- consuming. The next step - data conversion - is usually
- measured in hundreds of hours. Once again another order
- of magnitude more expensive and time consuming. If you
- happen to be unlucky enough to do data capture you are
- going to spend significant effort in terms of time and
- money capturing the data.
-
-
- PRICE QUOTES
-
- Price quotes are one way to evaluate capture vendors.
- When you think of page scanning it is usually quite
- straightforward. It's quoted on a price per page. Prices
- per page using a sub-contractor usually range anywhere
- from fifteen cents a page to a dollar a page depending
- mostly on quality. Its one of the significant cost
- factors. When we talk about quality we are really talking
- about the amount of skew on the page, how well aligned
- the type is on the page and the readability of the text
- or the pictures on the page. One thing to be aware of in
- price quotes is media deposit fees. There is usually a
- media deposit fee and it's sometimes refundable and
- sometimes not. As an example, we recently completed a
- relatively large CD-ROM project which included
- significant data capture. We selected a capture vendor
- and then discovered that they could only supply us data
- on 1600 BPI tapes. The project actually ended up
- entailing thousands of tapes, which had they not been
- returnable, would have been thousands of additional
- dollars in media.
-
-
- TEXT CAPTURE
-
- Capturing the text off the page is the most complex part
- of data capture. As a consumer of data capture services
- we've found that in keystroking versus OCR, keystroking
- tends to produce higher accuracy and is cheaper if source
- documents can be sent offshore. When I talk about sending
- keystroking offshore it usually means India, the
- Philippines, China, and sometimes Mexico. What you are
- really talking about is whether your data is secure and
- can be sent offshore. If it can't be sent offshore the
- costs are roughly the same between keystroking and OCR.
- The costs are usually measured in thousands of
- keystrokes. The rates are usually eighty cents to three
- dollars per thousand keystrokes. The difference in cost
- factors are quality, volume and very often turn-around -
- how soon you need the data back.
- One of the critical parameters in your cost is how
- many characters you have on a page. It's very easy to
- count the characters on a single page but until you do
- the keystroking, unless you have a lot of patience, you
- don't know how many total characters you have. That makes
- budgeting very difficult. Normally we try to pick out
- several random pages, count the characters, multiply by
- the number of pages and hope for the best. A lot of
- capture vendors use file size as a measure of the number
- of keystrokes they have captured. The thing to be aware
- of is that if your data comes back in some word
- processing format such as WordPerfect, a lot of
- additional coding gets inserted into the WordPerfect
- file. If using file size to count keystrokes then you
- will be charged for these additional characters. Again
- this can inflate your costs.
- Tabs are a very interesting part of keystroking
- because tabs can have a profound impact on tables. This
- is a sample table that has about 370 characters if you
- ignore the spaces and tabs. If you use spaces to align
- the columns you have 633 spaces. If you use tabs you are
- going to use 24 tabs. We're talking about the difference
- between about 1000 keystrokes and 400 keystrokes. So we
- have about a 250% higher cost if you use spaces instead
- of tabs. Make sure directions to your keystrokers are
- very explicit regarding tables.
-
-
- QUALITY
-
- You would like to have first generation documents. Second
- and third generation documents - photocopies - tend to
- lose resolution. You end up with poor quality scanned
- images and less readable characters. Paper stock is
- significant particularly when you are scanning because
- you get bleed-through from the print on the backside of
- the page. If you have very thin newsprint you can get
- bleed-through. If your source documents are microfiche or
- microfilm the normal route is that they are printed to
- paper and then the paper is captured electronically. A
- microfilm print usually is relatively low quality. The
- two areas that can get you into trouble with OCR are non-
- proportionally-spaced text and typographical effects such
- as using the vertical bars to outline tables. Those tend
- to fool a lot of the current OCR software.
- Graphics - paper stock and copy generation comes
- into play. Size comes into play if you have larger or
- smaller documents than 8.5 x 11 inches. These tend to
- cost more. With graphics capture most vendors like to
- take a stack of paper, put it in a page feeder and run it
- through a scanner. So if you have over-sized or under-
- sized documents they are going to influence your costs.
- If you have fine lines they can get lost in scanning. If
- you have half-tones or photographs there are various
- techniques that can be applied at scanning time to
- enhance the actual scan of them but these techniques will
- also influence costs.
- Think about what you want to do with a scanned
- image. You normally want to present it to the user at a
- workstation and eventually they want to print it. Most of
- the desktop laser printers today don't print within about
- 1/8" all around the border of a page. If you have source
- documents that are printed all the way up to the edge of
- the page you are likely to lose the border when they are
- printed on the laser printer. A fairly common trick is to
- photo-reduce the page before it goes through the scanner.
- Photo reduce it to 98% its original size and then you
- don't worry about the border around the edges. Again,
- these steps tend to drive your costs up.
- The other issue is skew. There are two things that
- cause skew. Often you are capturing from a printed book.
- The binding gets split off the book and then that the
- ream of paper gets stuck into the sheet feeder of a page
- scanner. That cut edge is not going to be very even. If
- that edge happens to be the alignment edge you end up
- with things skewed. The other thing that causes skew is
- original documents that are printed skewed on the page.
- This most often happens if working from photocopies. It
- is very difficult for the machine to recognize skew. The
- only way we've found to do it is to have an operator look
- at those images.
- In addition there are the hidden costs, the product
- management costs occur in three areas, quality, content
- and validation.
-
-
- QUALITY AND CONTENT VALIDATION
-
- People pay attention to things that are measured. Capture
- vendors pay attention to what you measure. Therefore you
- want to measure the things that are important in your
- capture project. The things that are important are
- quality, quality, quality and then content.
- The real issue with validation is that you would like to
- make clear to the vendor how you are going to measure
- their quality and content. That way if there's any
- disagreement whether the relative quality level was met
- you've already established how you were going to measure
- it and whether it is acceptable or not. Some of the
- techniques for validating quality and content are
- programmatic validation and spot checking. We use a suite
- of these things. We divide a very large capture project
- into batches. Then we apply some statistical sampling
- techniques to the batches. If the samples pass then we
- make the assumption based on the sampling algorithms that
- the whole batch is ok. And we establish these techniques
- with the vendor. The thing to be aware of in a data
- capture project finding errors is the hard part. Once you
- have found them fixing them is very inexpensive. So its
- unacceptable to say to the vendor "I found an error in
- this word, and I want you to fix it." The vendor will say
- "I'll fix the word and we'll be done with it." What you
- really want to say is, "I found an error in this batch
- and I want you to redo the batch."
-
-
- BUDGETING DATA CAPTURE PROJECT
-
- Some of us have the luxury of a variable budget. Most of
- us don't. Living within a fixed cost budget can be fairly
- difficult since many cost components of a capture project
- are hard to predict. One recommendation we have is to
- think about budgeting for only part of the data. You have
- a fixed budget that applies to as much data as you can
- get through the process. You have a history of data and
- you would like to capture some number of years. You start
- working backwards until your data capture dollars run out
- and that's where you cut and make your CD-ROM disk. A
- fairly effective technique where you don't have to budget
- as closely for unforeseen costs. And the more current
- data is usually the more valuable to users. The question,
- of course, is whether you can live with only part of the
- data.
-
-
- PROBLEMS IN SPECIFICATION OF PROJECT
-
- We highly recommend written specifications for capture
- projects. Even with written specifications, problems can
- occur. Here's an example of a paragraph with a dropped
- header. On the left hand side we have the header and
- beside it the text. When you give instructions to the
- keystroker what you want them to first capture the phrase
- of the paragraph header followed by the text of the
- paragraph. The table below it looks exactly the same. It
- has left hand column information and table text. Unless
- you went through your entire set of source documents and
- marked them accordingly your capture vendor wouldn't
- necessarily know how to key these things.
-
-
- CONCLUSION
-
- Data capture involves a series of tasks. You want to
- organize your raw data, separate, copy, name and develop
- some tracking procedures, and figure out tagging
- requirements for your text search engine. We suggest
- sending these to the capture vendor prior to starting the
- project. They can help identify problems or areas that
- are questionable.
- You will want to tie a graphic reference in the text
- to a specific graphic. This is normally done by
- developing graphic naming conventions. Normally the
- graphics scanning and data capture are two separate data
- streams, sometimes performed in altogether different
- locations. The streams come together at the end and are
- tied together via the naming conventions. If you get your
- naming conventions right, both locations name graphics
- the same and it all ties together.
- You'll want to do some sampling with the capture
- vendor. Look very carefully at the results of the
- samples. When you get the whole data set back the first
- thing you want to verify is that you have content, you
- have all the data you expected to get. When you get the
- bill back you want to validate that the bill corresponds
- to the amount of text you had captured.
- Then you evaluate quality which means trying to get
- some hint of the relative quality of the captured text.
- It is very difficult, incidentally to establish that and
- it's also expensive. So you will probably go to sampling
- techniques to establish it.
- Then you want to glue it together and take off with
- the steps Judy talked about. We are a development house
- that can help you with capture or entire CD-ROM projects.
- We also sell CD-ROM development software and have been
- involved in a lot of data capture projects. If you have
- an interest in this kind of work we'd love to talk to
- you.
-
- Related graphics to this paper:
-
- %g THO01.pcx;
- %g THO02.pcx;
- %g THO03.pcx;
- %g THO04.pcx;
- %g THO05.pcx;
- %g THO06.pcx;
- %g THO07.pcx;
- %g THO08.pcx;
- %g THO09.pcx;
- %g THO10.pcx;
- %g THO11.pcx;
- %g THO12.pcx;
- %g THO13.pcx;
-
-
-
-
- DATA CONVERSION: MOVING FROM PAPER TO PLASTIC
-
- Judith A. Zidar
- National Agricultural Text Digitizing Program
- National Agricultural Library, Beltsville MD
-
-
- The following is a summary of the data conversion process
- involved when moving information from paper to CD-ROM. It
- is assumed that such an effort proceeds according to
- established project management guidelines. That is, a
- project manager has been appointed and the project team
- assembled. A budget has been established, and a project
- plan with a time line has been developed. Even though no
- one actually sticks to the time table on their first
- CD-ROM project, such a tool helps to organize the various
- tasks and sets forth the expected flow of the work. Just
- be prepared to update the time line as the project moves
- along (or fails to move along, as will happen on
- occasion).
- It is also assumed that indexing and retrieval
- software has been selected, or that it will be selected
- sometime during the project. Selection of such software
- is not covered here.
-
-
- DATA CONVERSION PROCESS (DATABASE CREATION)
-
- The data conversion process as described here is based on
- the experience gained by the National Agricultural Text
- Digitizing Program (NATDP) at the National Agricultural
- Library. NATDP takes collections of reference materials
- on a single topic, such as Aquaculture or Food
- Irradiation, optically scans the material and performs
- text recognition on it, and then places the images and
- text on CD-ROM for distribution to the agricultural
- community. The eight steps given below would apply to
- most CD-ROM projects, although the tasks for each step
- may vary depending on the nature of each project.
-
-
- 1. DEFINE USER AND DATABASE REQUIREMENTS
-
- Who are the users? (novice, experts; subject knowledge)
- Why and how are they using it? (casual, research; browse,
- word search)
- How often are they going to use it? (occasional use
- requires a more intuitive interface than daily use)
- Need full text? Images?
-
- Developer's special requirements?
- Preservation issues.
- Legal requirements.
- Time and cost limitations.
-
- 2. COLLECT SOURCE DOCUMENTS
-
- Publications
- Manuscripts
- Machine-readable files
-
-
- 3. PREPARE SOURCE DOCUMENTS FOR PROCESSING (DATA PREP)
- AND DESIGN DATABASE
-
- NOTE: Data prep and database design are listed
- by most experts as two separate steps. In
- actual practice, however, we find ourselves
- doing them at the same time, as they go
- together like hand in glove.
-
- (1) Review and organize source material;
- assign sequence
- numbers.
-
- (2) Define contents; records; fields.
-
- (3) Mark up the source material, record by
- record.
-
- (4) Prepare worksheets or other tracking
- sheets.
-
- (5) Assign descriptors, other enhancements.
-
- (6) Determine file and directory naming scheme
- (40-100
- files per directory).
-
- (7) Determine other files needed for database,
- such as
- thesaurus, hierarchy, or table of contents
- files. (May
- depend on retrieval software and on user
- needs.)
-
-
- 4. PERFORM DATA CAPTURE AND ENHANCEMENT
-
- NOTE: Data capture is the largest and most
- time consuming step in the entire process. It
- could be more economical to have a service
- bureau perform this step.
-
- (1) Scan images, OCR text, convert
- machine-readable files.
-
- (2a) Edit text as necessary.
- (2b) Add any enhancements, such as
- bibliographic data and descriptive
- information.
-
- (3) Add field tags or other codes; or assure
- format consistency so a program can be written
- to do this.
-
- (4) Backup files as they are processed, and/or
- archive to stable medium.
-
- (5) Maintain a logbook to track database files
- through processing.
-
-
- 5. CREATE LABEL ARTWORK
-
- (1) Get label artwork specifications from
- mastering facility.
-
- (2) Design artwork. Most labels include the
- following:
-
- (a) Background design.
- (b) Title of CD-ROM.
- (c) Developer agency's name and logo.
- (d) Standard CD-ROM logo; name of
- mastering facility (if required by the
- facility); and "Made in USA."
- (e) Retrieval software used to access the
- disc.
- (f) Publication date (usually, month and
- year mastered).
- (g) Other descriptive information you may
- want on your disc label.
-
- (3) Create camera-ready copy.
-
- (4) Create film positive, emulsion side up.
- (Mastering facility will do this for you for a
- small fee, if you cannot provide it.)
-
-
-
- 6. BUILD AND INDEX DATABASE
-
- NOTE: This step is best performed on a system
- with enough storage space to hold the entire
- database + software -- up to 650 MB on one
- partition.
-
- (1) Prepare text files for indexing.
- e.g.: Sort and/or move (both text & images)
- Insert links and tags (if not done at data
- capture)
- Concatenate text files if required by software
- Create "indexing" files (.def; .hir; .tbl)
- Validate files
-
- (2) Index textual material -- may index many
- times.
-
- (3) Create links, hyperlinks.
-
- (4) Add other database files (thesaurus, t/c,
- etc.)
-
- (5) Test and retest the database.
-
- (6) Create and test installation program for
- retrieval software. (Vendor may supply.)
-
- (7) Backup entire database w/software to
- stable medium.
-
- (8) Create logical format for ISO 9660. Use
- CD-Publisher or similar system; or List
- directory and file structure for mastering
- facility.
-
- (9) Write database w/software to 9-track tape
- or other portable medium (DAT, CD-R, WORM)
-
-
-
- 7. PREPARE END-USER DOCUMENTATION
-
- (1) Prepare User Manual -- software vendor
- often supplies.
-
- (2) Prepare quick-start tutorial or guided
- tour -- software vendor may supply; otherwise,
- prepare inhouse.
-
- (3) Have documentation reviewed by editorial
- committee, if required.
-
- (4) Send to printer.
-
-
-
- 8. MASTER CD-ROMS.
-
- (1) Send tapes and artwork to mastering
- facility. (Government agencies can go through
- GPO or NTIS.)
-
- (2) Request test disc.
-
- (3) Review test disc.
-
- (4) Authorize creation of CD-ROMs.
-
-
-